pmartR PackageThis vignette describes the pmartR package functionality for quality control (QC) processing of mass spectrometry (MS) pan-omics data, in particular proteomics data at the peptide level. This includes data transformation, specification of groups that are to be compared against each other, filtering of features and/or samples, data normalization, and data summarization (correlation, principal components analysis). It is based on example data in the pmartRdata package.
Additional datasets available in the pmartRdata package include the following data types: protein level data, lipidomic data, and metabolomic data. A separate vignette describes the statistical analysis capabilities of pmartR.
The pmartR package relies on the creation of a standardized data object that is used for data manipulation. This data object requires specific formatting of input data and selection of data type to ensure that analyses are computed correctly. These data objects will be referred to overall as omicsData herein, where specific classes of omicsData will be referred to more specifically.
Below we load in 3 data.frames from the pmartRdata package. In this example, \(p\) is the number of unique peptides observed and \(n\) is the number of samples, respectively 17407 peptides and 12 samples.
pep_edata: A \(p * (n + 1)\) data.frame of expression data, where each row corresponds to data for each peptide and each column corresponds to a unique sample identifier. An additional peptide identifier/name column should also be present anywhere in the data.frame.
The example denotes the unique peptide identifier/name column by ‘Mass_Tag_ID’ and all other column names are unique sample identifiers.
pep_fdata: A data.frame with \(n\) rows. Each row corresponds to a sample with one column giving the unique sample identifiers found in e_data column names and other columns providing qualitative and/or quantitative traits of each sample (e.g. experimental group, sample metadata, patient demographics).
The example contains the sample identifier as “SampleID” and ‘Condition’ indicates the experimental group for each sample, either ‘Infection’ (samples that were infected with some pathogen) or ‘Mock’ (uninfected samples used as a control).
pep_emeta: An optional data.frame with at least \(p\) rows and a column with column name and entries identical to the unique identifier column in e_data. Each row corresponds to a peptide with one column giving peptide names and other columns giving meta information (e.g. mappings of peptides to proteins, class of molecule, pathways).
The example pairs the peptide identifiers (‘Mass_Tag_ID’, corresponding with the same column in edata) with the protein identification name to which each mass tag ID maps (‘Protein’), a reference identifier (‘Ref_ID’), and the peptide sequence (‘Peptide_Sequence’). Note that the mass tag IDs are unique (\(n_{peptide} =\) 17407) while the protein identifiers are not (\(n_{protein} =\) 2774) since multiple peptides often map to the same protein.
pep_edata
data("pep_edata") ## Load data.frame from pmartRdata
dim(pep_edata) ## Print dimensions of the data.frame (rows, columns)
pep_edata[1:6,1:5] ## Visualize the first 6 rows and first 5 columns of data
## [1] 17407 13
| Mass_Tag_ID | Infection1 | Infection2 | Infection3 | Infection4 |
|---|---|---|---|---|
| 1024 | NA | NA | NA | NA |
| 1047 | 17953839 | 20071472 | 20745779 | 18206556 |
| 1055 | 109536335 | 115459820 | 106127139 | 74522014 |
| 1104 | 1.752e+09 | 1.797e+09 | 1.703e+09 | 2.438e+09 |
| 1110 | 2571804 | 4269824 | 4852871 | 2630414 |
| 1214 | 110239193 | 82436688 | 100447189 | 1.02e+08 |
pep_fdata
data("pep_fdata") ## Load data.frame from pmartRdata
dim(pep_fdata) ## Print dimensions of the data.frame (rows, columns)
head(pep_fdata) ## Visualize the first 6 rows of data
## [1] 12 2
| SampleID | Condition |
|---|---|
| Infection1 | Infection |
| Infection2 | Infection |
| Infection3 | Infection |
| Infection4 | Infection |
| Infection5 | Infection |
| Infection6 | Infection |
pep_emeta
data("pep_emeta") ## Load data.frame from pmartRdata
dim(pep_emeta) ## Print dimensions of the data.frame (rows, columns)
head(pep_emeta) ## Visualize the first 6 rows of data
## [1] 17407 4
| Mass_Tag_ID | Protein | Ref_ID | Peptide_Sequence |
|---|---|---|---|
| 1024 | ACBP_HUMAN | 324 | K.QATVGDINTERPGMLDFTGK.A |
| 1047 | ALBU_HUMAN | 580 | T.ECCHGDLLECADDR.A |
| 1055 | RL40_HUMAN | 9898 | K.TITLEVEPSDTIENVK.A |
| 1104 | ALBU_HUMAN | 580 | K.KVPQVSTPTLVEVSR.N |
| 1110 | CYC_HUMAN | 2918 | K.TGPNLHGLFGR.K |
| 1214 | G3P_HUMAN | 4480 | K.VGVNGFGR.I |
Several different classes of omicData are available in pmartR depending on if the user is conducting proteomic, lipidomic, or metabolomic analysis. They are classified as follows:
Proteomics (Peptides) - as.pepData(), as.isobaricpepData()
Proteomics (Proteins) - as.proData
Lipidomics - as.lipidData
Metabolomics - as.metabData()
Each of these data types have example data also included in the pmartRdata package in analogous formats to the peptide data described above. Different classes permit other class specific functions available in pmartR, including quantification of peptide-protein mappings using proteomic_filter(), normalizing proteomic data with spans_procedure, and quantifying proteoforms with bpquant(). For our example, as.pepData() is the appropriate choice and we will demonstrate each of these functions.
Feature note: Sample names beginning with numbers or containing dashes are modified in R when processed by read.csv() or data.frame() functions. This modification replaces dashes with periods and pastes an “X” before column names, throwing an error in when using the above functions since column names of e_data are not identical to f_data sample identifiers. Passing the argument check.names = FALSE will circumvent this issue; an example below demonstrates this instance using metabolomic data.
edata <- read.csv(system.file("extdata",
"metab_edata_sample_names.csv",
package="pmartRdata"),
header=TRUE)
edata2 <- read.csv(system.file("extdata",
"metab_edata_sample_names.csv",
package="pmartRdata"),
header=TRUE, check.names=FALSE)
fdata <- read.csv(system.file("extdata",
"metab_fdata_sample_names.csv",
package="pmartRdata"),
header=TRUE)
all(names(edata)[-1] == fdata$SampleID) ## Do the sample names match each other?
## [1] FALSE
all(names(edata2)[-1] == fdata$SampleID) ## Do the sample names match each other?
## [1] TRUE
The pepData object contains the 3 components loaded above. During object creation, we specify the names of the identifier columns in the e_data, f_data, and e_meta components of the data, as well as the scale of the data (edata_cname = "Mass_Tag_ID", emeta_cname = "Protein", fdata_cname = "SampleID", data_scale = "abundance", respectively).
We can use the summary function to get basic summary information for our pepData object.
mypepData <- as.pepData(e_data = pep_edata,
edata_cname = "Mass_Tag_ID", ## Unique biomolecule identifier
f_data = pep_fdata,
fdata_cname = "SampleID", ## Unique sample identifier
e_meta = pep_emeta,
emeta_cname = "Protein", ## Grouping identifier of interest
data_scale = "abundance")
class(mypepData) ## pmartR class of data object
summary(mypepData) ## pmartR summary of data object
## [1] "pepData"
Alternatively, the pmartRdata package contains the corresponding pepData object, which can be loaded as follows:
data("pep_object") ## Load pre-rendered pmartR pepData object
class(pep_object) ## pmartR class of data object
## [1] "pepData"
rm(pep_object) ## remove data object (we use 'mypepData' herein)
We can use the plot() function to display box plots for each sample in the pepData object.
plot(mypepData)
Feature note: Long, inconvenient-for-plotting sample names can be modified using the function custom_sampnames(), where the identifiers can be trimmed at the user’s discretion (see “VizSampNames” column in code chunk below; all three commands generate the same trimming for the first 6 rows).
head(pmartR::custom_sampnames(mypepData, firstn = 3)$f_data)
head(pmartR::custom_sampnames(mypepData, from = 1, to = 3)$f_data)
head(pmartR::custom_sampnames(mypepData, delim = "e", components = 1)$f_data)
| SampleID | Condition | VizSampNames |
|---|---|---|
| Infection1 | Infection | Inf |
| Infection2 | Infection | Inf |
| Infection3 | Infection | Inf |
| Infection4 | Infection | Inf |
| Infection5 | Infection | Inf |
| Infection6 | Infection | Inf |
Once the pepData object is created, a typical QC workflow follows the figure below.
Figure 1. Quality control and processing workflow in pmartR package.
We will walk through the steps in this QC workflow using mypepData. Statistics are covered in a separate vignette, ‘Statistical Analysis with the pmartR Package’.
The formatting and preprocess steps of pmartR include zero-replacement, log transformation, and designating groups of interest to compare.
Sometimes omics abundance data contains 0s when the particular biomolecule was not detected for a given sample. It is our practice to replace any 0s with NAs as opposed to imputing the missing values. We strongly prefer NAs to imputation because we cannot know whether the biomolecule is not present in the sample, simply was below the limit of detection, or was present but missing from our data at random. The function edata_replace() can be used to replace values in the pepData object.
mypepData <- edata_replace(mypepData, x = 0, y = NA)
## 0 instances of 0 have been replaced with NA
In this dataset, there were no 0s so nothing was replaced.
We also highly recommend log transforming the data prior to analysis. pmartR supports log2, log10, and natural logarithm transformations. The edata_transform() function provides this capability. We can also use edata_trasnform() to transform the data back to the abundance scale if needed. Note that the scale of the data is stored and automatically updated in the data_info$data_scale attribute of the pepData object. Below we log10 transform the data, then return to the abundance scale, and finally settle on the log2 scale.
mypepData <- edata_transform(mypepData, data_scale = "log10")
attributes(mypepData)$data_info$data_scale
## [1] "log10"
mypepData <- edata_transform(mypepData, data_scale = "abundance")
attributes(mypepData)$data_info$data_scale
## [1] "abundance"
mypepData <- edata_transform(mypepData, data_scale = "log2")
attributes(mypepData)$data_info$data_scale
## [1] "log2"
plot(mypepData, bw_theme = TRUE)
Feature note: For isobaricpepData, please normalize using normalize_isobaric() before continuing. This function will normalize your data by a designated reference pool sample. If you have isobaric-labeled peptide data with no reference pool samples or have already accounted for reference samples, use as.pepData to read in your data.
Finally, we are preparing this data for statistical analysis where we will compare the samples belonging to one group to the samples belonging to another, and so we must specify the group membership of the samples. We do this using the group_designation() function, which modifies our pepData object and returns an updated version of it. Up to two main effects and up to two covariates may be specified (using covariates argument), with one main effect being required at minimum. The time_course argument permits users to designate a column containing time points. For the example data, we only have one option - specify the main effect to be “Condition” - since we do not have any additional information about the samples. Certain functions we will use below require that groups have been designated via the group_designation() function.
mypepData <- group_designation(mypepData, main_effects = "Condition",
covariates = NULL, time_course = NULL)
The group_designation() function creates an attribute of the dataset as follows:
attributes(mypepData)$group_DF
plot(mypepData, color_by = "Condition", bw_theme=TRUE)
Feature note: Plotting in pmartR is automated for each omicsData object; further customization is available using the arguments color_by and order_by, allowing the user to select a single column from f_data to highlight changes or arrange sample order in the plot. Default plotting theme can also be altered by bw_theme = TRUE, equivalent to the effects of ggplot2 function theme_bw().
It is often good practice to filter out biomolecules (peptides in this case) that do not meet certain criteria, and pmartR offers 5 different filters. These are applied as follows:
proteomics_filter() - Designed for peptide data with protein mappings in emeta; plot and summary indicate the number of peptides mapping to a protein and vice versa. User can designate a minimum number of peptides mapping to a protein when applying this filter as well as the inclusion/exclusion of degenerate peptides, defined as any peptide that maps to multiple proteins.
molecule_filter() - Applicable across all data types, filters out biomolecules that are observed in at least X samples, where X is an integer defined by the user.
cv_filter() - Applicable across all data types, filters biomolecules based on a user-designated pooled coefficient of variation. (Ahmed 1995)
custom_filter() - Applicable across all data types, filters biomolecules or samples based on a user-designated biomolecule names. This is best used for removing contaminants, but also also can remove botched sample results.
imdanova_filter() - Applicable across all data types, filters biomolecules based on the statistical requirements of user-designated statistical tests. These tests are an analysis of variance (ANOVA) to test for differences between means of designated groups and a G-test for the independence of missing data (Running group_designation is required before this filter can be used). User can select either or both to apply to the omicsData object. (Webb-Robertson et al. 2010)
Each of the filtering functions calculates metric(s) that can be used to filter out biomolecules and returns an S3 object separate from the omicsData object. Using the summary() function on the filter object produces a summary of the metric(s) and using the plot() function produces a graph. Filters that require a user-specified threshold to filter out peptides have corresponding summary and plot methods that take optional argument(s) to specify that threshold. Once one of the 5 peptide level filter functions has been called, the results of that function can be used in conjunction with the applyFilt() function to actually filter out peptides based on the metric(s) and user-specified threshold(s) and create a new, filtered data object.
The proteomics filter is applicable only to peptide level data that contains the e_meta component, as it counts the number of peptides that map to each protein and/or the number of proteins to which each individual peptide maps. It returns a list of two character vectors, the first, peptides_filt, giving degenerate peptide names. The second, proteins_filt, gives the names of proteins which no longer have peptides mapping to them in the dataset. This filter does not require a user-specified threshold.
myfilter <- proteomics_filter(mypepData)
summary(myfilter)
##
## Summary of Proteomics Filter
##
## Obs Proteins Per Peptide Obs Peptides Per Protein
## Min. 1 1
## 1st Qu. 1 1
## Median 1 3
## Mean 1 6.27505407354001
## 3rd Qu. 1 7
## Max. 1 333
##
## Filtered 0 0
## Not Filtered 17407 2774
plot(myfilter, bw_theme=TRUE)
With the current dataset, the majority of proteins have one peptide mapping to them (graph on the left) and all peptides map to a single protein (e.g. no degenerate peptides; graph on the right). If filtering is necessary, refer to below example code for removal of degenerate peptides and requiring at least 10 peptides mapping to each protein:
mypepData_temp <- applyFilt(filter_object = myfilter, mypepData,
degen_peps = TRUE,
min_num_peps = 10)
summary(mypepData_temp)
rm(mypepData_temp)
Filters that require a user-specified threshold to filter out peptides have corresponding summary and plot methods that take optional argument(s) to specify that threshold.
The molecule filter allows the user to remove from the dataset any biomolecule not seen in at least min_num of the samples. The user may specify a threshold of the minimum number of times each biomolecule must be observed across all samples; the default value is 2.
myfilter <- molecule_filter(mypepData)
summary(myfilter, min_num = 3)
##
## Summary of Molecule Filter
## ----------------------------------
## Minimum Number:3
## Filtered:3342
## Not Filtered:14065
##
## num_observations frequency_counts
## 1 1814
## 2 3342
## 3 4911
## 4 5934
## 5 6737
## 6 7530
## 7 8389
## 8 9279
## 9 10383
## 10 11527
## 11 12987
## 12 17407
plot(myfilter, bw_them = TRUE)
Setting the threshold to 3, we would filter 3,342 peptides out of the dataset. If we’d like to make the filter less stringent, we could use a threshold of 2 and only filter 1,872 peptides.
summary(myfilter, min_num = 2)
##
## Summary of Molecule Filter
## ----------------------------------
## Minimum Number:2
## Filtered:1814
## Not Filtered:15593
##
## num_observations frequency_counts
## 1 1814
## 2 3342
## 3 4911
## 4 5934
## 5 6737
## 6 7530
## 7 8389
## 8 9279
## 9 10383
## 10 11527
## 11 12987
## 12 17407
mypepData <- applyFilt(filter_object = myfilter, mypepData, min_num = 2)
summary(mypepData)
The coefficient of variation (CV) filter calculates the pooled CV values as in (Ahmed 1995).
The user can then specify a CV threshold, above which peptides are removed.
myfilter <- cv_filter(mypepData)
summary(myfilter, cv_threshold = 150)
##
## Summary of Coefficient of Variation (CV) Filter
## ----------------------
## CVs:
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.04778 24.85999 32.28337 34.83979 42.00244 168.69275
##
## Total NAs: 359
## Total Non-NAs: 15234
##
## Number Filtered Biomolecules: 2
plot(myfilter, cv_threshold = 150, title_size = 15, bw_theme = TRUE)
mypepData <- applyFilt(filter_object = myfilter, mypepData, cv_threshold = 150)
summary(mypepData)
Sometimes it is known a priori that certain peptides or samples should be filtered out of the dataset prior to statistical analysis. Perhaps there are known contaminant proteins, and so peptides mapping to them should be removed, or perhaps something went wrong during sample preparation for a particular sample. On the other hand, it may be preferred to specify peptides or samples to keep (removing those not explicitly specified), and this can also be accomplished. Keep in mind that both ‘remove’ and ‘keep’ arguments cannot be specified together; either ‘remove’ arguments only or ‘keep’ arguments only may be specified in a single call to custom_filter().
Here, we demonstrate the removal of the peptide with Mass Tag ID 1047 and sample Infection 1 as an example.
myfilter <- custom_filter(mypepData,
e_data_remove = 1047, e_meta_remove = NULL,
f_data_remove = "Infection1")
summary(myfilter)
##
## Summary of Custom Filter
##
## Filtered Remaining Total
## SampleIDs (f_data) 1 11 12
## Mass_Tag_IDs (e_data) 1 15590 15591
## Proteins (e_meta) 0 2511 2511
mypepData_temp <- applyFilt(filter_object = myfilter, mypepData)
summary(mypepData_temp)
rm(mypepData_temp)
We also demonstrate how to use this filter with the ‘keep’ option by keeping the samples Infection 1, Infection 2, Infection 3, Infection 4 and Infection 5.
myfilter2<- custom_filter(mypepData,
e_data_keep = NULL, e_meta_keep = NULL,
f_data_keep = c("Infection1",
"Infection2",
"Infection3",
"Infection4",
"Infection5"))
summary(myfilter2)
##
## Summary of Custom Filter
##
## Kept Discarded Total
## SampleIDs (f_data) 5 7 12
## Mass_Tag_IDs (e_data) 15591 0 15591
## Proteins (e_meta) 2511 0 2511
mypepData_temp2<- applyFilt(filter_object = myfilter2, mypepData)
summary(mypepData_temp2)
rm(mypepData_temp2)
Feature note: Note that there is a summary() method for objects of type custom_filt, but no plot() method.
The IMD-ANOVA filter removes peptides that do not have sufficient data for the statistical tests available in the pmartR package; these are ANOVA (quantitative test) and an independence of missing data (IMD) using a g-test (qualitative test) as specified in (Webb-Robertson et al. 2010). When using the summary(), plot(), and applyFilt() functions, you can specify just one filter (ANOVA or IMD) or both, depending on the tests you’d like to perform later. Using this filter speeds up the process of performing the statistical tests.
myfilter <- imdanova_filter(mypepData)
Here we consider what the filter would look like for both ANOVA and IMD.
summary(myfilter, min_nonmiss_anova = 2, min_nonmiss_gtest = 3)
##
## Summary of IMD Filter
##
## Total Observations: 15591
## Filtered: 2052
## Not Filtered: 13539
# plot(myfilter, min_nonmiss_anova = 2, min_nonmiss_gtest = 3)
Here we consider what the filter would look like for just ANOVA.
summary(myfilter, min_nonmiss_anova = 2)
##
## Summary of IMD Filter
##
## Total Observations: 15591
## Filtered: 5511
## Not Filtered: 10080
#plot(myfilter, min_nonmiss_anova = 2)
Here we consider what the filter would look like for just IMD.
summary(myfilter, min_nonmiss_gtest = 3)
##
## Summary of IMD Filter
##
## Total Observations: 15591
## Filtered: 2280
## Not Filtered: 13311
#plot(myfilter, min_nonmiss_gtest = 3)
Now we apply the filter for both ANOVA and IMD.
mypepData <- applyFilt(filter_object = myfilter,
mypepData, min_nonmiss_anova = 2,
min_nonmiss_gtest = 3)
summary(mypepData)
Sample filtering is encouraged in pmartR where samples are determined to be outliers, as these samples can impact further data analyses. Potential outlier detection is available in pmartR using the filter function rmd_filter(). Other functions in pmartR, dim_reduction() and cor_result() help the user determine if potential outliers are true outliers by visualizing groupings of samples and sample correlation trends.
To identify any samples that are potential outliers or anomalies (due to sample quality, preparation, or processing circumstances), we use a robust Mahalanobis distance (rMd) (Matzke et al. 2011) score based on 2-5 metrics. The possible metrics are:
Correlation
Proportion of data that is missing (“Proportion_Missing”)
Median absolute deviation (“MAD”)
Skewness
Kurtosis
The rMd score can be mapped to a p-value, and a p-value threshold used to identify potentially outlying samples. In general, for proteomics data we recommend using correlation, proportion missing, MAD, and skew. A plot of the rMd values for each sample is generated, and specifying a value for ‘pvalue_threshold’ results in a horizontal line on the plot, with samples above the line (null hypothesis that the sample is not an outlier can be rejected) slated for filtering at the given threshold.
myfilter <- rmd_filter(mypepData, metrics = c("Correlation",
"Proportion_Missing",
"MAD",
"Skewness"))
summary(myfilter, pvalue_threshold = 0.001)
##
## Summary of RMD Filter
## ----------------------
## P-values:
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000001 0.2686711 0.4152817 0.5050117 0.8306939 0.9707249
##
## Metrics Used: MAD, Skewness, Corr, Proportion_Missing
##
## Filtered Samples: Infection5, Infection8
plot(myfilter, pvalue_threshold = 0.001, bw_theme=TRUE)
Using the output from the summary() function, we can explore the outliers identified to see what metrics are driving their outlier-ness. Box plots for each metric are graphed, with the specified sample marked with an ‘X’.
plot(myfilter, sampleID = "Infection5", bw_theme=TRUE)
plot(myfilter, sampleID = "Infection8", bw_theme=TRUE)
We can see that Infection5 has the highest MAD, low skewness, and almost the lowest correlation, but the lowest proportion missing compared to the other samples. Infection 8 has MAD close to the median MAD value, the highest skew, the lowest correlation, and the highest proportion of missing data. We can use this information to determine whether to remove either or both of these samples. Sometimes it is also useful to look at additional data summaries to inform outlier removal, such as a principal components plot or correlation heat map.
The pmartR package contains various methods for data summarization and exploration that can be used as part of the QC process: numeric summaries and associated plots (via the edata_summary() function), sequential projection pursuit principal components analysis (sppPCA) for dimension reduction (via thedim_reduction() function), and a correlation heat map (via the cor_result() function) (Stacklies et al. 2007), (Webb-Robertson et al. 2013). As for missing data summarization there are also functions to collect and plot missing data information. Missing data collection and plotting functions include, missingval_result(), plot_missingval(), missingval_scatterplot() and missingval_heatmap().
We can generate numeric summaries of our data by either sample or molecule. The argument groupvar can be used to specify if results should be split across an experimental group column. The edata_summary() function computes the mean, standard deviation, median, percent of observations for which a value was observed, the minimum value, and the maximum value.
The output of this function is a list of data.frames; we merge the list output here for conciseness.
listofsummary <- edata_summary(omicsData = mypepData, by = "sample", groupvar = NULL)
# Veiw as a single data.frame for vizualizing in rmd
dfsummary <- listofsummary[[1]]
for(i in 2:length(listofsummary)){
dfsummary <- merge(dfsummary,listofsummary[[i]])
}
print(head(dfsummary))
## sample mean sd median pct_obs min max
## 1 Infection1 21.78491 2.414556 21.75849 0.7965138 12.09441 30.94101
## 2 Infection2 21.89052 2.312247 21.83069 0.7900140 12.53455 30.92200
## 3 Infection3 21.85070 2.350433 21.79653 0.7303346 13.53600 30.94588
## 4 Infection4 21.81850 2.446171 21.79627 0.6917054 12.19476 31.18318
## 5 Infection5 22.22338 2.553529 22.26306 0.8023488 11.82416 31.74673
## 6 Infection6 21.87246 2.347379 21.80666 0.7264938 12.67596 31.37799
Probabilistic is a principal components analysis (PCA) algorithm that can be used in the presence of missing data (Stacklies et al. 2007), (Webb-Robertson et al. 2013). We use the dim_reduction() function in pmartR, specifying the dataset and the number of principal components to return (default value of 2 principal components). The summary() and plot() functions operate on the results of the dim_reduction() function.
Note that dim_reduction() requires that the group_designation() function has been run.
pca_results <- dim_reduction(mypepData, k = 2)
# summary(pca_results)
plot(pca_results, bw_theme=TRUE)
Pearson correlation between samples is calculated based on pairwise complete biomolecules.
correlation_results <- cor_result(mypepData)
summary(correlation_results)
## Summary of Correlation Matrix (Upper Triangle)
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.7804 0.8688 0.9143 0.9032 0.9496 0.9722
plot(correlation_results)
Patterns of missing data can be explored using the missingval_result() and plot_missingval() functions.
results <- missingval_result(mypepData)
plot(results, type = "bySample", bw_theme=TRUE)
plot(results, type = "byMolecule", bw_theme=TRUE)
In addition, the missingval_scatterplot() and missingval_heatmap() functions provide more views of the missing data.
missingval_scatterplot(mypepData, palette = "Set1", bw_theme=TRUE)
missingval_heatmap(mypepData)
The next step in a typical workflow is to normalize the data. Normalization can help reduce background noise while maintaining significant trends in datasets and often helps meet assumptions for statistical analysis.
Normalization approaches consist of a subset method and a normalization function (Åstrand 2003), (Gautier et al. 2004). Available subset methods include:
Percentage of peptides present (PPP): Subset the data to peptides present in at least \(p\) percent of samples. This subset method is specified as “ppp”.
Rank invariant peptides (RIP): First subset peptides to those present in every sample (e.g. complete peptides). Next, subject each peptide to a Kruskal Wallis test on group, and those peptides not significant at a given p-value threshold are retained as invariant. This subset method is specified as “rip”.
PPP-RIP: Rank invariant peptides among peptides present in a given percentage of samples. This subset method is specified as “ppp_rip”.
Top “l” order statistics (LOS): The peptides with intensities in the top “l” order statistics are retained. This subset method is specified as “los”.
Available normalization functions include:
Median centering: The sample-wise median (median of all peptides in a given sample) is subtracted from each peptide in the corresponding sample. This normalization method is specified as “median”.
Mean centering: The sample-wise mean (mean of all peptides in a given sample) is subtracted from each peptide in the corresponding sample. This normalization method is specified as “mean”.
Z-score transformation: The sample-wise mean (mean of all peptides in a given sample) is subtracted from each peptide, and the result is divided by the sample-wise standard deviation (standard deviation of all peptides in a given sample) in the corresponding sample. This normalization method is specified as “zscore”.
Median absolute deviation (MAD) transformation: The sample-wise median (median of all peptides in a given sample) is subtracted from each peptide, and the result is divided by the sample-wise MAD (MAD of all peptides in a given sample) in the corresponding sample. This normalization method is specified as “mad”.
There are two ways to go about normalizing data in the pmartR package. If you know the normalization approach you’d like to use, you can directly specify it in the normalize_data() function. Or, if you want to see how compatible different normalization approaches are with your dataset and you’re using proteomics data, you can use the spans_procedure() function.
If we know what approach we want to use, we can go straight to the normalization.
# Normalize using all peptides and median centering #
norm_object <- normalize_global(omicsData = mypepData,
subset_fn = "all",
norm_fn = "median",
apply_norm = FALSE,
backtransform = TRUE)
norm_data <- normalize_global(omicsData = mypepData,
subset_fn = "all",
norm_fn = "median",
apply_norm = TRUE,
backtransform = TRUE)
plot(norm_data,
title_plot="Normalized Data: Median Centering Using all Peptides",
title_size=12,
bw_theme = TRUE)
# Normalize using RIP 0.2 and median centering #
norm_object <- normalize_global(omicsData = mypepData,
subset_fn = "rip",
params=list(rip=0.2),
norm_fn = "median",
apply_norm = FALSE,
backtransform = TRUE)
norm_data <- normalize_global(omicsData = mypepData,
subset_fn = "rip",
params=list(rip=0.2),
norm_fn = "median",
apply_norm = TRUE,
backtransform = TRUE)
plot(norm_data,
title_plot="Normalized Data: Median Centering Using RIP 0.2 Peptides",
title_size=12,
bw_theme = TRUE)
This is the same function that would be used post-SPANS if the user chooses to run SPANS on proteomic data.
If using proteomics data and unsure how to normalize, the spans_procedure() function will generate statistics the user can use to select a method. SPANS ranks different combinations of sub-setting and normalization methods based on a ‘score’ that captures how much bias a particular normalization procedure introduces into the data. A higher ‘score’ implies less bias. (Webb-Robertson et al. 2011)
SPANS tests bias for all the aforementioned sub-setting and normalization procedures by default. Default settings are used to determine the best normalization procedure for mypepData.
# returns a dataframe arranged by descending SPANS score
spans_result <- spans_procedure(mypepData)
## [1] "Finished creating background distribution, moving to method candidate selection"
## [1] "Finished method candidate selection, proceeding to score selected methods."
## [1] "Finished scoring selected methods"
summary(spans_result)
##
## Summary of spans procedure
##
## Highest ranked method(s)
## subset_method normalization_method SPANS_score parameters
## 1 rip median 0.999 0.1
## 2 rip median 0.999 0.15
## 3 rip median 0.999 0.2
## 4 rip median 0.999 0.25
## 5 ppp_rip median 0.999 0.1;0.1
## 6 ppp_rip median 0.999 0.25;0.15
## mols_used_in_norm passed_selection rank
## 1 1554 TRUE 1
## 2 1369 TRUE 1
## 3 1158 TRUE 1
## 4 987 TRUE 1
## 5 3195 TRUE 1
## 6 3195 TRUE 1
##
## Number of input methods: 68
## Number of methods scored: 30
## Average molecules used in normalization: 5277
plot(spans_result)
# a list of the parameters for any normalization procedure with the best SPANS score
best_params <- get_spans_params(spans_result)
# there are a few ties, all using ppp_rip
# we'll select the method that uses median normalization and parameters ppp = 0.1 and rip = 0.1
subset_fn = best_params[[1]]$subset_fn
norm_fn = best_params[[1]]$norm_fn
params = best_params[[1]]$params
# we now pass the extracted subset function, normalization function, and parameters from SPANS to normalize_global()
norm_object <- normalize_global(omicsData = mypepData,
subset_fn = subset_fn,
norm_fn = norm_fn,
params = params,
apply_norm = TRUE,
backtransform = TRUE)
Feature note: SPANS is restricted to peptide and protein data and works best when large datasets are available due to data sub-setting (though it does take some time). Errors may generate for certain procedures if the dataset size is insufficient. To fix this issue, try reducing the number of subset functions that are tested.
Protein quantification can be done using the protein_quant() function, either with or without accounting for different isoforms of the proteins, also called ‘proteoforms’. Regardless of whether the user is accounting for protein isoforms, they must specify a method for rolling peptides up to proteins (one of “rollup”, “rrollup”, “qrollup”, or “zrollup”) and a function to use for combining the peptide-level data (either “mean” or “median”). When the user does wish to account for protein isoforms, they have the option for doing so with Bayesian Proteoform Quantification (BP-Quant) (Webb-Robertson et al. 2014).
Proteoforms (including protein post-translational modification, splice variants, etc.) are important factors to consider when assessing the biological impact of experimental groups. BP-Quant predicts any proteoforms present for each protein group by identifying trends from quantitative (ANOVA) and qualitative (G-test for independence of missing data) tests across experimental group comparisons. These relationships are summarized in the format of a vector of “signatures”, where 0 indicates no statistically significant difference between experimental groups and 1 or -1 respectively represent a statistically significant increased or decreased value between experimental groups. BPQuant uses these signatures in the Bayesian model to select the set of proteoforms with the maximum likelihood for each protein (Webb-Robertson et al. 2014).
In pmartR, this method is available using the function bpquant(), demonstrated below. Note that this function takes advantage of statistical results generated from pmartR using the function imdanova(), further covered in the vignette ‘Statistical Analysis with the pmartR Package’.
The output of this function produced a list, where each element contains a data.frame defining proteoform identifiers for each peptide mapping to that protein. Where proteoform ID is zero, the peptide does not fit in the dominant trends identified for that protein.
peptideStats <- imd_anova(norm_object,
test_method = "combined",
pval_thresh = 0.05,
pval_adjust = 'bonferroni')
proteoforms <- bpquant(peptideStats, norm_object)
proteoforms[[49]] ## An example of an output element
Feature note: Where there is insufficient data for either statistical test, the peptide abundances cannot be converted to a signature. The imdanova_filter will help to determine if this is the case for any dataset!
The rollup method takes either the mean or median of all peptides mapping to a given protein, and sets that value as the relative protein abundance.
In the rrollup method, all peptides that map to a single protein are scaled based on a chosen reference peptide, which is the peptide with most observations. Next the average or mean of the scaled peptides is set as the relative protein abundance. (Matzke et al. 2013)
The qrollup method starts with all peptides that map to a single protein. Next peptides are chosen according to an abundance cutoff value and the average or mean of the scaled peptides is set as the protein abundance. (Polpitiya et al. 2008)
In the zrollup method, scaling is done similarly to the z-score formula (except with medians instead of means). The scaling formula is applied to peptides that map to a single protein and then the mean or median of the scaled peptides is set as protein abundance (from DAnTE article). The rollup method is similar to rrollup method, except there is no scaling involved in these methods. Either the mean or median is applied to all peptides that map to a single protein to obtain protein abundance. (Polpitiya et al. 2008)
The protein_quant() function returns a pmartR data object of class proData.
myprodata_rrollup<- pmartR::protein_quant(pepData = norm_object, method = "rrollup")
print(myprodata_rrollup)
## only first 5 columns are shown
## e_data
## Protein Infection1 Infection2
## 1 ALBU_HUMAN 30.7284373069703 30.6093269371943
## 2 RL40_HUMAN 23.8194727741833 23.7523221873391
## 3 CYC_HUMAN 22.4898474100337 22.3580306811018
## 4 G3P_HUMAN 28.8165894353077 28.4266412920063
## 5 ---- ---- ----
## 2250 GSDMB_HUMAN <NA> <NA>
## 2251 gi|29836496|ref|NP_828851.1| 25.7649605608504 25.6610986933972
## 2252 H1N1_CA04_NP <NA> 19.431129958381
## 2253 NXPH2_HUMAN 19.6465108627434 17.6474393552717
## Infection3 Infection4
## 1 30.6317989290622 31.2311848782248
## 2 23.8767966168028 23.6976560416636
## 3 22.6535079624622 21.9723492626176
## 4 28.580536128935 28.6681866136105
## 5 ---- ----
## 2250 20.2167998857096 <NA>
## 2251 25.6180431466804 25.5663956728011
## 2252 <NA> <NA>
## 2253 <NA> 18.46654128675
##
## f_data
## SampleID Condition
## 1 Infection1 Infection
## 2 Infection2 Infection
## 3 Infection3 Infection
## 4 Infection4 Infection
## 5 ---- ----
## 9 Infection9 Infection
## 10 Mock1 Mock
## 11 Mock2 Mock
## 12 Mock3 Mock
##
## e_meta
## Protein Ref_ID Peptide_Sequence
## 1 ALBU_HUMAN 580 T.ECCHGDLLECADDR.A
## 2 RL40_HUMAN 9898 K.TITLEVEPSDTIENVK.A
## 3 CYC_HUMAN 2918 K.TGPNLHGLFGR.K
## 4 G3P_HUMAN 4480 K.VGVNGFGR.I
## 5 ---- ---- ----
## 2250 GSDMB_HUMAN 4995 K.CLGKEDIRQDLEQR.V
## 2251 gi|29836496|ref|NP_828851.1| 22574 K.EIDRLNEVAK.N
## 2252 H1N1_CA04_NP 22287 R.EGYSLVGIDPFK.L
## 2253 NXPH2_HUMAN 8006 K.ICYQEQTQSHVSWLCSK.P
## peps_per_pro n_peps_used
## 1 5 5
## 2 8 8
## 3 6 6
## 4 49 49
## 5 ---- ----
## 2250 1 1
## 2251 2 2
## 2252 1 1
## 2253 1 1
rm(myprodata_rrollup)
Ahmed, SE. 1995. “A Pooling Methodology for Coefficient of Variation.” Sankhyā: The Indian Journal of Statistics, Series B, 57–75.
Åstrand, Magnus. 2003. “Contrast Normalization of Oligonucleotide Arrays.” Journal of Computational Biology 10 (1): 95–102.
Gautier, Laurent, Leslie Cope, Benjamin M Bolstad, and Rafael A Irizarry. 2004. “Affy—Analysis of Affymetrix Genechip Data at the Probe Level.” Bioinformatics 20 (3): 307–15.
Matzke, Melissa M, Joseph N Brown, Marina A Gritsenko, Thomas O Metz, Joel G Pounds, Karin D Rodland, Anil K Shukla, et al. 2013. “A Comparative Analysis of Computational Approaches to Relative Protein Quantification Using Peptide Peak Intensities in Label-Free Lc-Ms Proteomics Experiments.” Proteomics 13 (3-4): 493–503.
Matzke, Melissa M, Katrina M Waters, Thomas O Metz, Jon M Jacobs, Amy C Sims, Ralph S Baric, Joel G Pounds, and Bobbie-Jo M Webb-Robertson. 2011. “Improved Quality Control Processing of Peptide-Centric Lc-Ms Proteomics Data.” Bioinformatics 27 (20): 2866–72.
Polpitiya, Ashoka D, Wei-Jun Qian, Navdeep Jaitly, Vladislav A Petyuk, Joshua N Adkins, David G Camp, Gordon A Anderson, and Richard D Smith. 2008. “DAnTE: A Statistical Tool for Quantitative Analysis of-Omics Data.” Bioinformatics 24 (13): 1556–8.
Stacklies, Wolfram, Henning Redestig, Matthias Scholz, Dirk Walther, and Joachim Selbig. 2007. “PcaMethods—a Bioconductor Package Providing Pca Methods for Incomplete Data.” Bioinformatics 23 (9): 1164–7.
Webb-Robertson, Bobbie-Jo M, Melissa M Matzke, Susmita Datta, Samuel H Payne, Jiyun Kang, Lisa M Bramer, Carrie D Nicora, et al. 2014. “Bayesian Proteoform Modeling Improves Protein Quantification of Global Proteomic Measurements.” Molecular & Cellular Proteomics 13 (12): 3639–46.
Webb-Robertson, Bobbie-Jo M, Melissa M Matzke, Jon M Jacobs, Joel G Pounds, and Katrina M Waters. 2011. “A Statistical Selection Strategy for Normalization Procedures in Lc-Ms Proteomics Experiments Through Dataset-Dependent Ranking of Normalization Scaling Factors.” Proteomics 11 (24): 4736–41.
Webb-Robertson, Bobbie-Jo M, Melissa M Matzke, Thomas O Metz, Jason E McDermott, Hyunjoo Walker, Karin D Rodland, Joel G Pounds, and Katrina M Waters. 2013. “Sequential Projection Pursuit Principal Component Analysis–Dealing with Missing Data Associated with New-Omics Technologies.” Biotechniques 54 (3): 165–68.
Webb-Robertson, Bobbie-Jo M, Lee Ann McCue, Katrina M Waters, Melissa M Matzke, Jon M Jacobs, Thomas O Metz, Susan M Varnum, and Joel G Pounds. 2010. “Combined Statistical Analyses of Peptide Intensities and Peptide Occurrences Improves Identification of Significant Peptides from Ms-Based Proteomics Data.” Journal of Proteome Research 9 (11): 5748–56.